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The concept of "true score" lies at the heart of much of 
classical rr.ental test theory and, as mentioned in the previous paper in 
this session (Finucci (1^771) ) , is the basis of the derivation of 
"attenuation theory" (formulas which correct correlation coefficients 
for perturbing effects of errors of measurement) • So much a part of the 
thinking of mental test specialists has the concept of "true score" become 
that the intuitions and consequences tuat can be derived from such a 
concept are trequently applied in situations where neither the "true score 
plus error" model nor the conclusions resulting from that model are 
applicable* In reticular, misapplication of the "true score" concept 
seems to be behind the commonly hold opinion V it test validity can be 
increased hy increasing test or i'ltui reliability* This opinion was shown 
by Loevinger ( 1954 ) so be false in a certain statistical model useful in 
item analysis *f the dichotomous ly-scm?d items found in many aptitude tests* 
Loevinger (195^ J ) named the assertion which she verified in her paper "the 
attenuation paradox". 

if Presented at the American Educational Research Association 55th 
~ Annual Meeting, New York City> February 4-7 , 1971* 
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The Concise Oxfo rd Dictionary defines "paradox" as a "statement 
contrary to received opinion . . * seemingly absurd though perhe.ps really 
well-founded . .., conflicting with pre- conceived notions of what is 
reasonable or possible". Loevinger's ( 195*0 "attenuation paradox" asserts 
that it is possible to 'attenuate" or reduce test validity with an increase 
in test and/or average item reliability. Although the word "attenuation" 
appears both in "attenuation paradox" and in "correction for attenuation", the 
connection between these concepts lies not so much in their use of a common 
vford, but through the warning the "paradox" should give to practitioners 
who .naively use the "correction for attenuation" formulas in inappropriate 
statistical contexts. 

In the present paper, we first try to indicate why the concept of 
"true score" naturally leads to the belief that test validity must increase 
with an increase in test and/or average item reliability, and why for the 
classical single-factor model first introduced by Spearman (l9^4a) this belief 
is, in fact, correct. Next, we introduce the statistical model used by 
Loevinger ( 195*0 to e ablish the "attenuation paradox", and in intuitive 
terms attempt to explain why the "attenuation paradox" holds in this 
particular model. We do this by showing that high (internal) consistency or 
reliability of test scores is an asset in increasing teat validity under the 
classical single-factor statistical model for mental tests, but can be a 
liability when item scores are modelled as in the statistical model discussed 
by Loevinger, It is hoped that by this exposition, mental test specialists 
will be led to more critical appraisal of commonly used techniques and 
concepts (including the "corrections for attenuation"), and will check that 
their methods f test construction and comparison are consistent with their 
statistical models. 
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2. THE CLASSICAL MODEL 

A central aim of mental test construction is to find a test which 
assesses with maximal accuracy the extent cr level to which a given mental trait is 
possessed by an individual or individuals. In the classical statistical 
model of Spearman (l$(Aa,b), it is assumed that the level of the mental 
trait in question can be measured by a single variable Y. Values of Y are 
assumed to have some probability distribution over that population of 
individuals which is of interest. Without essential loss of generality for 
this theoretical discussion, we may assume that Y has a mean (expectation) 
of zero and a variance of one. 

If we could observe Y without error, there would , of course, 
be no need for a theory of mental tests (at least insofar as this theory refers 
to test construction). However, in general the tr«.it level Y is not directly 
observable - it is latent. What we observe are scores on N 

items. These items (sub-tests, questions, reaction times, etc.) are assumed 
to be statistically related to Y in that each item score individually can 
be used to predict or estimate Y by means of statistical regression 
techniques. For a given individual (given value of y), it is assumed that 
the item scores X^Xg,...^ are (conditionally) statistically independent, 

A. y. 

ard that given Y, the i item score X^ has (conditional) mean Y and 

2 

(conditional) variance i = 1, 2,...,N. 

The abjve assumptions relating the item scores X^,X^, . . . ,X^, and 
Y are equivalent to a single-factor statistical model for the item scores, 
with Y as the common factor and each item score X^ having equal factor 
loading on Y. Consequently, we can assert that 




x i = 



Y + E. 



1 = 1 , 2 , . . 






N, 



3 



- 4 - 



where Y, E^, are statistically independent, each 51 has mean 

2 

equal to zero, and the variance of E.. is c^, i = 1, 2,,., 3 N« The model 

(l) is a '’true score plus error" model for the item scores, and in this 

model, Y is the "true score" . 

Tn justify this model empirically it has been necessary for 

Spearman, Gulliksen (1950), and other mental test theorists to conceive of 

each item as being replicable on the same individual in such a way that the 

tVl 

item scores and on the i item and its replication are associated 
only through the fact that an individual brings the same mental trait level 
Y to bear on the replicated items, Stated statistically, these theorists have 
had to assume that en item could be paired tfith a supposedly parallel or 
identical item in such a way that the resulting item scores have the 
representation 



(2) 



x, = y + e., 

i i* 



X! Y + E', 
i i' 



where Y, E^, E^ are indeptr.dent, and and E! have the came distribution. 

r 

(Thus, and Ej both have mean zero and variance a^). Such assumptions 
are open to criticism, both in terms of the circularity in definiten 
required to operationally define parallel items (see Loevinger (l$47> 1957 ), 
Ross and Lurntden {1968)), and in terms of difficulty of practical application 
(see Finucci (197 1^). However, if accepted, these assumptions imply that if 
ve could infinitely replicate an item, the average of the resulting item scores 
would equal Y* Hence, it seems that by maximizing internal consistency, 
ve can almost perfectly estimate Y by choosing a test having a large 
enough collection of replicated items. 
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Item analysis aims at choosing items in such a way that maximal 

test validity is achieved with a minimal set of items. A mental test is thus 

a choice of items frcm a certain item pool of N items. Let T denote the 

list of indices of items chosen (for example, T might equal [l, 3> 9> 10 > 12}). 
t h 

If the i item is used in our test, we write i e T. The test score T is 
the sum of item scores over all items in the test; hence T = ,L X.. 

ieT 1 



If there are n items, n < N, in the test, then 



(3) i T = — 7" X. = Y + i £ E. = Y + E . 

n n , _ i n . _ i 

ieT ieT 

Hence T/n also can be written in "true-score-plus-e^mr" form. The 

"error" here is E = ~ Z E. which is statistically independent of Y, and 
n , _ i 

i cT ^ — - 2 2 

has mean 0 and variance a s ( \ a</ij ). 

ieT 

To measure the accuracy with which the test score serves as an 
estimate of Y (i.e., the validity of the test), for theoretical purposes we 
may use the Pearson product-moment correlation coefficient between T 

and Y. Using formula ( 3 ) for T, 



( *0 P 1Y " P (T/n)Y = (1 \ 

Since Y is unobservable, we cannot in practice estimate p^. directly. 

Various sample measures of validity do exist (split-half validity, correlation 

with another test presumed to measure Y, etc.). However, these are fairly 

difficult to obtain in most cases. However , from formula ( 3 ) ve see that 

T/n differs from Y by an error term E which is independent of Y and 

2 2 

has mean 0 and variance o . As 0 becomes smaller, E becomes less 

and less variable about its mean of 0. Thus justifies as a measure 
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2 2 

of accuracy since - 1/ (l+o ) increases to 1 as a converges to 0 

(see formula (4)). Cn the ether hand, if we conceive of replicating each 

of the iterr.s in the test as ir, (2)> we can think of a repji^ated test with 

test score T f = T X! = n(Y + E 1 ), We car\ thus measure the consistency, 
ieT 

precision, or r eliabili ty of our test score by seeing how well T can predict 
its replicate T 1 (remember that both replicates are given to each 
individual). A measure of this predictability is the product -moment 
correlation p , which equals 

^ P TT’ * °(T/n)(T'/n) = “ ~ q 2 

The fact that the reliability , is inversely related to the variance 
2 

0 is intuitively obvious when we note that T - T 1 + E-E 1 , Since E and E 1 

2 

are independent, the variance of E-E' is 2a . Hence the smaller the 

variation ir* the error term E (and its replicate E' ) is, the better able 

2 

we are to predict T fron T* (or T t from T). As o goes to 0, p^, 
increases to 1 (see (5)), as is proper for a measure of reliability. 
Comparing formulas (4) and (5), ve see that 

(6) P^Y ~ Vp TT * 

Consequently , we have verified mathematically that in the classical Spearman 
single-factor model, test validity increases monoton ically with test 
reliab J lity. However, this direct tie between test validity and test 
reliability occurs because in the "true score plus error" model satisfied by 
the test score T, the error term doubles as both an indicator of how 
accurately T measures Y and as an indicator of hew repe at able the test s core 
T is when the test is replicated on the sa^.e individual. 
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Before leaving the classical model, we pause to point out that 



test construction is often done by choosing those n items from the item 



pool which have maximum item reliabilities . This practice of 

A i i 

judging a choice of items solely by item reliabilities, rather than also by 



consideration of the correlations p v v between items (as would be necessary 

J \ l • A • 

1 3 

in multiple regression)? is also a consequence of the classical "true-score- 



plus-error" nodel (l). Indeed if we calculate p , and p v Y ? we find 

11 i j 

that 



(7) 



J X.X‘ -2 ’ 

11 1 + O i 



X i X J ./Hof ./Hof 



1 X. 

so that p v Y = p^ Yt p~ yT , The extremely tight correlational structure 
i 0 i i 0 0 

revealed by this last mathematical result is not surprising, of course, 

when we recall that our model is a single- factor model. Remembering that 

n a = a., and making use of (5) and (7), we find that the test 

ieT 1 

reliability p , is a function solely of the item reliabilities; namely, 



( 8 ) 






IT 



n-1 dr (i 



ieT 



x.x; 

1 X 



From formula {§), we see that if ve want to choose the best test consisting 

of n items, ve should choose the n items having highest reliability in our 

pool of items. Approximating the harmonic mean [ T {l/p v Y ,)(l/n)) * by the 

ieT i i 

arithmetic mean 
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(9) 



- 1 

p = n 



o. 



ieT 



x.x: 



in (8), we only decrease , but obtain 



( 10 ) 



P TT' > 



n 



n-1 




p 



n p 

(n-l)P+l 



The right side of this inequality is of course, the Spearman** Brown prophecy 
formula (see Finucci (1 97-0) commonly used for assessing test reliability. 



3. THE NORMAL OGIVE MODEL 

The classical statistical model described In Section 2 implicitly 
assumes that item scores are continuouo variables. There is nothing in the 
model outlined i.n Section 2 to make this assumption necessary. Mental test 
tradition, however, has assumed that the mental trait level Y is a 
continuous random variable. Indeed, tradition further assumes that Y has 
a normal distribution. Despite challenges to this tradit-.xn (see, for 
example, Humphreys (1956)), most ir.enta?. test theorists continue to adhere 
to the view that mental trait levels are continuously (normally) distributed. 
If thin view is accepted, then a result from px‘Obability theory tells us that 
the representation (l) for item scores X i implies that X i must be a 
continuous random variable. 

For most, of the types of data originally considered by Spearman, 
item (or sub-test) scores were continuous variables (or could be thought of 
as r^ur.ded-of f continuous variables). However, the basic items of modern 
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mental aptitude tests are multiple choice questions. These questions are 
customarily scored on a pass-fail, dichotomous basis {X^ = 1 or 0; or if ve 
correct for guessing, = 1 or -i/h, where k is the number of choices) 

Such item scores are clearly not continuous, except to a ridiculously gross 
approximation. Hence, we must either drop the "true-score-plus-error 11 model 
(l), or change our assumptions about the continuity of Y. 

One attempt to preserve the basic features of the Sptarrcan model, 

i 

and yet retain the assumption that Y is normally distributed, is the normal 
ogive model. Here, ve assume that to answer the i** 1 item in an H-itern pool 

i 

of dichotomous ly scored item;; , an individual calls upon a certain level ol 

\ 

aptitude v:hich is available to him at that point for answering the item. 

i 

This aptitude is assumed to ;jc related to the level Y of the underlying 

i 

mental trait of interest by ft single factor model equivalert to the model in 

I 

(l). However, it is additionally assumed that the level Y and the "error" 

| 

both are normally distributed variables. To pass the .i^ item (obtain a 
score o. s 1 nn the item )’ the individual’s level of aptitude must exceed 
a difficulty level a.: otherwise, S. = 0. Kerce. 

i • I L 



(u) 



s i = 



(i 1 if 

) 



L 



0 



if X i < a . 



o 
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Cur interest still is to Accurately measure Y for a given individual, 

bet now the observables are S, >S , ...,S rather than X ,X 0 X . 

I 1 C Jlj It U 

J 

If ve plan to ijcok for a "true-score-plus-error 11 model for the 

j th 

item scores, it soon becomes apparent that there is no way to wx*ite the i 

item score in a truc-jscore-plvs-error form in such a yay that the "true' 1 

term depends mono tor, icafly upon Y> the error term is independent of Y, and 

the tvo terns are statistically independent . For if such a representation 
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exists 3 the average of an infinite number of replications of the item 

must equal the "true score". Ibis limiting average, in the present model, 

th „ . T 

is equal to the conditional probability that the i item is passed given 
Y = y, or 

a . -y 

(1?) P(S. = l|Y=y)= 1 - *(-i— ), 

where £(z) is the probability that a standard N(0,l) normal variable is 
less than or equal to z. The graph of this "true score” against y is an 
S- shaped curve called a "normal ogive" (v, T hich gives the model the name ve 
have assigned to it), and by looking at this graph ve see that the true score 
is indeed monotonic (but non-linear) in the mental trait level Y=y. 
Unfortunately, the "error" - P(S/=l|Y=y) has (conditional) variance 
($((^-y)/a^'j]( l-Mfe^-y Va^,)] depending upon the "’true score” (12), so that 
the "error ’ is not independent of the "true score 11 . Hence, ve must be pre- 
pared for consequences of the normal ogive rr.ccel that seem "paradoxical" in 
terms of the "true-score-plus-errcr ' model. 

for example, under the normal ogive model there is an "attenuation 
paradox". To demonstrate this fact, ve first point put that items have 
maximum reliability when their difficulty level is zei'o - that is, when they 
have .50 probability of being "passed’'. This assertion is true regardless of 
what correlation the required aptitude level X, has with the underlying mental 
trait level V (see Sltgreaves (1961}, Tucker ( 1 $^ 6 )). Hence, if ve forget 
that vre are dealing vlth a statistical model in whici the "true-score-plus- 
error" model is n^t appropriate, and naively apply the results described ir 
the previous section, ve would decice to set the Item difficulties of all of 
our items at 0. Ills vouj.d indeed mean that the test score; 
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(13) S = ' - S. 

ieT 

would V ave x irv;m reliability. Further, item reliabilities (as measured by 

the phi coefficient between and its repetition Sp are mcnotonic&lly 

increasing with the "reliability" coefficient p y1 of the aptitude called 
th , 11 

upon to pass the i item (Sitgreaves (1S61, p. £0}), Since the validity 

of the X^ 1 s for measuring Y increases to 1 as the "aptitutde reliability’ 1 

v1 increases to 1, this leads us to expect that it the average of the 
x i 

n apti hide ve] ia.b i li by" coefficients , 



(1U) 



XX' 



1 

n 



ieT 



P X.X! 
1 1 



increases to one, so will the test validity 

Unfortunately (Tucker (lS*+6), loevinger ( 195*0 > SUgre&ves (1961)), 
this conclusion is talse. Instead, as the average "aptitude reliability" 

P XX’ increases ^ rcPl 0 the test validity coefficient at first 

rises, then reaches a maximum, and then drops (attenuates) as the average 
"aptitude reliability 11 p r continues to increaoe, 

A A 

Ihe follcving intuitive explanation for the phenomenon may give 
insight into the differences between the normal ogive model and the classical 
rr.cdcl. First note (see Gnlllkser. {lS^5))> that when the average "aptitude 
reliability" p , equals one and all of the item difficulties are zero, then 

AA 

all of the item sccres are 1 if Y is non-negative, and all of the item 
scores are 0 if Y is negative. Tn this case, all of the aptitudes X^ 
perfect lv measure Y , the item scores are perfectly reliable and accurate 
measures, but what is actually measured by the item scores is merely the 
answer t^ tht question: " Is Y nor. -negative 7" Here, ICO items provide no 

1 1 
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more information about Y than dees one item, and the information provided 
does no more than identify the sign (plus or minus) attached to the magnitude 
of Y. 

On the other hand, if we permit ourselves to use items which are 
less than perfectly reliable, and in fact assign item-difficulties of the 
f orm: 



( 15 ) 



"1 



= o. 



s -V + 



- -3 + 



33 ’ 

± 

33 5 



a i = - 3 + 2 < iy ) > 



a loo = -3-^3, 



O 

ERIC 



then when every aptitude X^ is exactly equal to Y, i = 1,2 ,..,,100, a test 
score of S = k, ] < k < ICO, tells us thVv the first k it^m scores 

are all 1 and that the last lCC-k iten scores ^k+2 9 * * • >^00 
are all 0. V, : ny? Because = 1 if and only if Y > ar.d since 

a l 1 a 2 - ••• - a lC0 3 fact ^ is S rea ^^ r than equal to 

(Sj = 1) implies that Y > a^ for all i < j (S^ = 1, all i < j), whereas 
if Y < for seme then Y < a^ for all i > In other words, S and Y 

are monoteni callv related. Further, if we knew that S = k, 1 < k < 100, ve 
can show that -3 + (2{k-l)/33) < Y < -3 + (2k/33), while S = 0 means that 
Y < -3, and S = ICO means that Y > 3* Clearly these 100 items, although 
each is less reliable than the items whose difficulties are all zero, 
provide greater validity for measuring Y. 
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From the above discussion, we see that if we know that the average 

"aptitude reliability" jiL , is close to cno, we are at a disadvantage in 

XX 

terms of test validity if we must keep item difficulties the same 

(a^ = 0, i = 1,2, . . . ,100), even though item reliability refry be maximized. 

Kence, the maxims of Section 2 provide no guideline in this case. Jf we are 
required to set all item difficulties equal to zero, then some other 
mechanism is needed to provide informatior about Y similar to that provided 
by the spread-out choices (15) for the a.’s* Amazingly enough, and in contrast 
to our use of the word "error", when average "aptitutde reliability" is less 
than one, the "errors" E. = X^-Y provide this mechanism and allow us to 
increase test validity. If we replace the word "error" by "randomization", 
this result snould not surprise statisticians (who knew that controlled 
randomization in sample survey and experimental design can improve accuracy 
of measurement), but it certainly will surprise anyone who is used to 
thinking of the error in the "true-score-plu, --error" nuclei as c source 

of lack of consistency and inaccuracy for measurement of Y. Nevertheless, 
in the present model a certain amount of "error" helps improve validity. 
Remembering that = ... = a ^ ^ anc3 < 

assume for convenience that = p, all i- Looking back at the definition 

i i 

of (Equation 0-3)), ve see that ve can rewrite ;n the fora 

f 1 if Y + Ej > 0, 

S i = I 0 if Y + E. < 0, 

( 16 ) 



j 1 if Y > - E i 

0 if Y < - E , 



i = 1,2 ICO. _ ir.ee Y and E^ ai'e independent, (l6) shows that the E^'s 

act as a ra nd on allo cati on of item diffic ulties for a rev normal ogive model 
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in which the item aptitudes X, are exactly equal to Y. hince almost surely 

i 

the values of E^,Eg, . . . ,E ^ C q ar€ une< l ua l (and i*- fact are a sample from a 
normal distribution with mean 0 and variance (l-p)/p), it seems reasonable 
that there is soma value of the common lr aptitude reliability” p (or range of 
values for p) where this random choice of item difficulties will improve test 
validity over the fixed choice a^, = 0, all i, If p is near 0, the 
variance of each E^ is nearly infinite, and the random item difficulties E^ 
will be too far spread out tc provide much information about Y. (This can 
also be seen by renumbering that ■= Y + E^, and noting that whan the 

variance of is near infinity, Y is basically unobservable.) For p near 1, 

the variance of each E^ is nearly 0, and thus E^ varies only very little 
from its mean of 0, so that this case is essertially that of fixed item 
difficulties. Hence, the value of p that will allow us to improve upon the 
fixed item difficulty, perfect reliability case, lies somewhere between f=0 
and p = 1, Kath erratic ally it is found that for a 100- item test, maximum test 
validity of p ^ = >91^9 occurs for an (average) "aptitude reliability" of 
p = .2268 (Sitgreaves (19-1))* 

The above discussion provides an example where an acceptance of 
"error 11 helps to improve accuracy of measurement. It also indicates that 
deviation from the classical "true-score-plus-error" model, no matter how 
iSeeiiingly trivial these deviations are, may have major consequences for the 
theory and interpretation of indices of test performance. In any testing 
problem, therefore, the mental test specialist would do better to base 
his methods and conclusions on the statistical model, rather than trusting to 
intuition obtained from the ’'truo-score-ptus-evrov* 1 model tc- guide 

his thinking. 
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